Development  and  Evaluation  of  a  Korean  Treebank  and  its  Application  to  NLP 


Chung-hye  Han*,  Na-Rare  Han1,  Eon-Suk  Ko*,  Martha  Palmer* 

*Dept.  of  Linguistics 
Simon  Fraser  University 
8888  University  Drive 
Burnaby  BC  V5A  1S6,  Canada 
chunghye  @  sfu.ca 

tDept.  of  Linguistics 
University  of  Pennsylvania 
619  Williams  Hall 
Philadelphia,  PA  19104,  USA 
{nrh.esko}  @ling.upenn.edu 

t  Dept,  of  Computer  Information  and  Science 
University  of  Pennsylvania 
256  Moore  School 
Philadephia,  PA  19104,  USA 
mpalmer@linc.cis.upenn.edu 
Abstract 

This  paper  discusses  issues  in  building  a  54-thousand-word  Korean  Treebank  using  a  phrase  structure  annotation,  along  with  developing 
annotation  guidelines  based  on  the  morpho- syntactic  phenomena  represented  in  the  cotpus.  Various  methods  that  were  employed  for 
quality  control  are  presented.  The  evaluation  on  the  quality  of  the  Treebank  and  some  of  the  NLP  applications  under  development  using 
the  Treebank  are  also  presented. 

1.  Introduction  project. 

.  .  .  .  v  ,  .  In  this  paper,  we  first  discuss  some  issues  in  develop- 

With  growing  interest  in  Korean  language  processing,  v  r  1 
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numerous  natural  language  processing  (NLP)  tools  tor  Ko-  &  &  &&  &  3 

,  .  e  ,  .  ,  ,  ,  •  ,  bracketing.  We  then  detail  the  annotation  process  in  S3., 

rean,  such  as  part-of-speech  (POS)  taggers,  morphological  &  1  3 

,  ,  ,  ,  ,  .  ,  including  various  methods  we  used  to  detect  and  correct 

analyzers  and  parsers,  have  been  developed.  This  progress  & 

.  i  ,  .  ,  .,  .,  .  annotation  errors.  S4.  presents  some  statistics  on  the  size  of 

was  possible  through  the  availability  oi  large-scale  raw  text  3  r 

j  .  ,  ,rTm  innn  v  ,  the  corpus.  «.  discusses  the  results  of  the  evaluation  on  the 

corpora  and  POS  tagged  corpora  (ETRI,  1999;  Yoon  and  1  0 

Choi,  1999a;  Yoon  and  Choi,  1999b).  However,  no  large-  Treebank’  and  §6'  Presents  some  of  the  NLP  applications 

,  ,  ,  .  ,  ,,  ,  we  did  so  far  using  the  Treebank. 

scale  bracketed  corpora  are  currently  available  to  the  pub-  & 

lie,  although  efforts  have  been  made  to  develop  guidelines 

for  syntactic  annotation  (Lee  et  al.,  1996;  Lee  et  al.,  1997).  2.  Guideline  development 

As  a  step  towards  addressing  this  issue,  we  built  a  54-  The  guiding  principles  employed  in  developing  the  an- 

thousand-word1  Korean  Treebank  using  a  phrase  structure  notation  guidelines  were  theory-neutralness  (whenever  pos- 

annotation  at  the  University  of  Pennsylvania,  creating  the  sible),  descriptive  accuracy  and  consistency.  To  this  end, 

Penn  Korean  Treebank.  At  the  same  time,  we  also  devel-  various  existing  knowledge  sources  were  consulted,  includ- 

oped  annotation  guidelines  based  on  the  morpho-syntactic  ing  theoretical  linguistic  literature  on  Korean,  publications 

phenomena  represented  in  the  corpus,  over  the  period  of  on  Korean  descriptive  grammar,  as  well  as  research  works 

Ian.  2000  and  April  2001.  The  corpus  that  we  used  for  the  on  building  tagged  Korean  copora  by  such  institutions  as 

Korean  Treebank  consists  of  texts  from  military  language  KAIST  and  ETRI  (ETRI,  1999;  Lee  et  al.,  1996;  Lee  et  al., 

training  manuals.  These  texts  contain  information  about  1997;  Yoon  and  Choi,  1999a;  Yoon  and  Choi,  1999b).  Ide- 

various  aspects  of  the  military,  such  as  troop  movement,  ally,  complete  guidelines  should  be  available  to  the  anno- 

intelligence  gathering,  and  equipment  supplies,  among  oth-  tators  before  annotation  begins.  However,  linguistic  prob¬ 
ers.  This  corpus  is  part  of  a  Korean/English  bilingual  cor-  lems  posed  by  corpora  are  much  more  diverse  and  com- 

pora  that  was  used  for  a  domain  specific  Korean/English  plicated  than  those  discussed  in  theoretical  linguistics  or 

machine  translation  project  at  the  University  of  Pennsylva-  grammar  books,  and  new  problems  surface  as  we  annotate 

nia.  One  of  the  main  reasons  for  annotating  this  corpus  was  m0re  data.  Hence,  our  guidelines  were  revised,  updated 

to  train  taggers  and  parsers  that  can  be  used  for  the  MT  and  enriched  incrementally  as  the  annotation  process  pro- 

_  gressed.  In  cases  where  no  agreement  could  be  reached 

‘This  word  count  is  computed  on  tokenized  texts  and  includes  among  several  alternatives,  the  one  most  consistent  with 
symbols.  the  overall  guidelines  was  chosen,  with  the  consideration 
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that  the  annotated  corpus  may  be  converted  to  accommo¬ 
date  other  alternatives  when  needed.  In  the  next  two  sub¬ 
sections,  we  describe  in  more  detail  the  main  points  of  the 
POS  tagging  guidelines  and  syntactic  bracketing  guidelines. 

2.1.  POS  tagging  and  morphological  analysis 

Korean  is  an  agglutinative  language  with  a  very  produc¬ 
tive  inflectional  system.  Inflections  include  postpositions, 
suffixes  and  prefixes  on  nouns,  and  tense  morphemes,  hon- 
orifics  and  other  endings  on  verbs  and  adjectives.  For  this 
reason,  a  fully  inflected  lexical  form  in  Korean  has  often 
been  called  a  WORD-PHRASE  (‘oj'tl’).  To  accurately  de¬ 
scribe  this  characteristic  of  Korean  morphology,  each  word- 
phrase  is  not  only  assigned  with  a  POS  tag,  but  also  anno¬ 
tated  for  morphological  analysis.  Our  Treebank  uses  two 
major  types  of  POS  tags:  14  content  tags  and  15  function 
tags.  For  each  word-phrase,  the  base  form  (stem)  is  given  a 
content  tag,  and  its  inflections  are  each  given  a  function  tag. 
Word  phrases  are  separated  by  a  space,  and  within  a  word- 
phrase,  the  base  form  and  inflections  are  separated  by  a  plus 
sign  (+).  In  addition  to  POS  tags,  the  tagset  also  consists  of 
5  punctuation  tags.  An  example  of  a  tagged  sentence  is 
given  in  (l).2 

(1)  a.  Raw  text: 

f. 

frequently  com_net-Acc  operate-Decl 

‘(We)  operate  communications  network  fre¬ 
quently.’ 

b.  Tagged  text: 

xf-^/ADV  -g- a]  TJ'/NNC+-§-/PC A 
£-g-/NNC+  sf/XS  V+ r_  c}/EFN  ./SFN 

The  main  criterion  for  tagging  and  also  for  resolving 
ambiguity  is  syntactic  distribution:  i.e.,  a  word  may  receive 
different  tags  depending  on  the  syntactic  context  in  which 
it  occurs.  For  example,  ‘o|-27j-’  ( some  time  ago)  is  tagged  as 
a  common  noun  (NNC)  if  it  modifies  another  noun,  and  is 
tagged  as  an  adverb  (ADV)  if  it  modifies  a  verb. 

(2)  a.  oj-^/ADV  7j-/W+%t/EPF-H4/EFN 

some_time_ago  go-Past-Decl 

b.  oj-^/NNC+Sj/PCA  °t^/NNC 
some_time_ago-Gen  promise 

One  important  decision  we  had  to  make  was  whether 
to  treat  case  postpositions  and  verbal  endings  as  a  bound 
morpheme  or  as  a  separate  word.  The  decision  we  make  on 
this  issue  would  have  consequences  on  syntactic  bracket¬ 
ing  as  well.  If  we  were  to  annotate  them  as  separate  words, 
it  would  be  only  natural  to  bracket  them  as  independent 
syntactic  units,  which  project  their  own  functional  syntactic 
nodes.  Although  some  may  favor  this  approach  as  theoreti¬ 
cally  more  sound,  from  a  descriptive  point  of  view,  they  are 

“NNC  and  NNX  are  noun  tags.  PAD,  PCA  and  PAU  are  noun 
inflectional  tags,  ADV  is  an  adverb  tag,  XSV  is  a  verbalizing  suf¬ 
fix  tag.  EFN  is  a  sentence  final  ending  tag,  and  SFN  is  a  punctu¬ 
ation  tag.  For  a  detailed  description  of  the  tagset,  see  (Han  and 
Han,  2001). 


more  like  bound  morphemes,  in  that  they  are  rarely  sep¬ 
arated  from  stems  in  written  form,  and  native  speakers  of 
Korean  share  the  intuition  that  they  can  never  stand  alone 
meaningfully  in  both  written  and  spoken  form.  To  reflect 
this  intuition,  we  have  chosen  to  annotate  the  inflections  as 
bound  morphemes  assigning  them  each  a  function  tag. 

2.2.  Syntactic  bracketing 

The  Penn  Korean  Treebank  uses  phrase  structure  an¬ 
notation  for  syntactic  bracketing.  Similar  phrase  structure 
annotation  schemes  were  also  used  by  the  Penn  English 
Treebank  (Marcus  et  ah,  1993;  Bies  et  ah,  1995),  the  Penn 
Middle  English  Treebank  (Kroch  and  Taylor,  1995)  and  the 
Penn  Chinese  Treebank,  (Xia  et  ah,  2000b).  This  annota¬ 
tion  is  preferable  to  a  pure  dependency  annotation  because 
it  can  encode  richer  structural  information.  For  instance, 
some  of  the  structural  information  that  a  phrase  structure 
annotation  readily  encodes,  which  dependency  annotations 
typically  do  not,  are  (i)  phrasal  level  node  labels  such  as 
VP  and  NP;  (ii)  explicit  representation  of  empty  arguments; 
(iii)  distinction  between  complementation  and  adjunction; 
and  (iv)  use  of  traces  for  displaced  constituents. 

Although  having  traces  and  empty  arguments  may  be 
controversial,  it  has  been  shown  in  (Collins,  1997;  Collins 
et  ah,  1999)  that  such  rich  structural  annotation  is  cru¬ 
cial  in  improving  the  efficiency  of  stochastic  parsers  that 
are  trained  on  Treebanks.  Moreover,  it  has  been  shown  in 
(Rambow  and  Joshi,  1997)  that  a  complete  mapping  from 
dependency  structure  to  phrase  structure  cannot  be  done, 
although  the  other  direction  is  possible.  This  means  that 
a  phrase  structure  Treebank  can  always  be  converted  to  a 
dependency  Treebank  if  necessary,  but  not  the  other  way 
around. 

The  bracketing  tagset  of  our  Treebank  can  be  divided 
into  four  types:  (i)  POS  tags  for  head-level  annotation  (e.g., 
NNC,  VV,  ADV);  (ii)  syntactic  tags  for  phrase-level  anno¬ 
tation  (e.g.,  NP,  VP,  ADVP);  (iii)  function  tags  for  gram¬ 
matical  function  annotation  (e.g.,  -SBJ  for  subject,  -OBJ 
for  object,  -ADV  for  adjunct);  and  (iv)  empty  category  tags 
for  dropped  arguments  (*pro*),  traces  (*T*),  and  so  on. 

In  addition  to  using  function  tags,  arguments  and  ad¬ 
juncts  are  distinguished  structurally  as  well.  If  YP  is  an 
internal  argument  of  X,  then  YP  is  in  sister  relation  with  X, 
as  represented  in  (3a).  If  YP  is  an  adjunct  of  X,  then  YP 
adjoins  onto  XP,  a  projection  of  X,  as  in  (3b). 

(3)  XP  XP 

Yp"  X  YP^  XP 

I 

X 

(a)  Argument  (b)  Adjunct 

The  syntactic  bracketing  of  example  (1)  is  given  in  the 
first  tree  of  Figure  1 .  This  example  contains  an  empty  sub¬ 
ject,  which  is  annotated  as  (NP-SBJ  *pro*).  The  object  NP 
poIVNNC+-§-/PCA’  is  assigned  the  -OBJ  function  tag, 
and  since  it  is  an  argument  of  the  verb,  it  is  structurally  a 
sister  of  the  verb.  The  adverb  is  an  adjunct  of  the 

verb,  and  so  it  is  adjoined  to  the  VP,  the  phrasal  projection 
of  the  verb. 


(S  (NP-SBJ  *pro*) 

(VP  (AD VP  xj-^/ADV) 

(VP  (NP-OBJ  -g- A!  T>UNNC+-§-/PC  A) 

(VV  £-§-/NNC+*j-/XSV+  V.  cf/EFN)))) 

,/SFN) 


(S  (NP-OBJ- 1  ^l*I/NNC+^-/PCA) 

(S  (NP-SBJ  ^f-f-/NPN+7)-/PCA) 

(VP  (VP  (NP-OBJ  *T*-1) 

7]-7]/VV+aI/EAU) 

5I/VX+  t:|  /EFN)) 

?/SFN) 

Figure  1 :  Examples  of  syntactic  bracketing 

An  example  sentence  with  a  displaced  constituent  is 
given  in  (4).  In  this  example,  the  object  NP  ap¬ 

pears  before  the  subject,  while  its  canonical  position  is  af¬ 
ter  the  subject.  Displacement  of  argument  NPs  is  called 
SCRAMBLING. 

(4)  'T"7j-  an? 

authority-Acc  who-Nom  have  be 

‘Who  has  the  authority?’ 

In  our  annotation  in  the  second  tree  of  Figure  1,  the  ob¬ 
ject  is  adjoined  to  the  main  clause  (S),  and  leaves  a  trace 
(*T*)  in  its  original  position  which  is  coindexed  with  it. 

A  potential  cause  for  inconsistency  is  making  argu¬ 
ment/adjunct  distinction.  To  ensure  consistency  in  this  task, 
we  extracted  all  the  verbs  and  adjectives  from  the  corpus, 
and  created  what  we  call  a  PREDICATE-ARGUMENT  LEX¬ 
ICON,  based  on  Korean  dictionaries,  usages  in  the  corpus 
and  our  own  intuition.  This  lexicon  lists  verbs  and  adjec¬ 
tives  with  their  subcategorization  frames.  For  instance,  the 
verb  ‘■£-§--8j-’  ( operate )  is  listed  as  a  transitive  verb  requir¬ 
ing  a  subject  and  object  obligatory  arguments.  We  also  have 
a  notation  for  optional  arguments  for  some  verbs.  For  in¬ 
stance,  in  (5),  it  is  not  clear  whether  ‘'Sj-ail'Hr  (to  school)  is 
an  argument  or  an  adjunct,  whereas  ‘o-jx-|]’  ( yesterday )  and 
‘-T'elxr’  (we)  seem  to  offer  clear  intuition  as  to  their  ad¬ 
junct  and  argument  status,  respectively.  This  is  resolved  by 
listing  such  categories  as  a  locative  optional  argument  for 
‘  7j-’  (to  go)  in  the  predicate-argument  lexicon. 

(5)  °-Hl  ’SJ-45LH]  3M-- 

we-Top  yesterday  school-to  go-Past-Decl 
‘We  went  to  school  yesterday.’ 

In  syntactic  bracketing,  while  obligatory  arguments  are 
annotated  with  -SB  J  or  -OBJ  function  tag,  if  a  sentence  con¬ 
tains  an  optional  argument,  it  is  annotated  with  a  -COMP 
function  tag.  Moreover,  a  missing  obligatory  argument  is 
annotated  as  an  empty  argument,  but  a  missing  optional  ar¬ 
gument  does  not  count  as  an  empty  argument. 

Another  potential  cause  for  inconsistency  is  handling 
syntactically  ambiguous  sentences.  To  avoid  such  incon¬ 
sistencies,  we  have  classified  the  types  of  ambiguities,  and 


specified  the  treatment  of  each  type  in  the  bracketing  guide¬ 
lines.  For  example,  a  subset  of  Korean  adverbs  can  oc¬ 
cur  either  before  or  after  the  subject.  When  the  subject  is 
phonologic  ally  empty,  in  principle,  the  empty  subject  can 
be  marked  either  before  or  after  the  adverb  without  differ¬ 
ence  in  meaning  if  there  is  no  syntactic/contextual  evidence 
for  favoring  one  analysis  over  the  other.  In  this  case,  to 
avoid  any  unnecessary  inconsistencies,  a  ‘default’  position 
for  the  subject  is  specified  and  the  empty  subject  is  required 
to  be  put  before  the  adverb.  An  example  annotation  is  al¬ 
ready  given  in  Figure  l.3 

3.  Annotation  process 

The  annotation  proceeded  in  three  phases:  the  first 
phase  was  devoted  to  morphological  analysis  and  POS  tag¬ 
ging,  the  second  phase  to  syntactic  bracketing  and  the  third 
phase  to  quality  control. 

3.1.  Phase  I:  morphological  analysis  and  pos  tagging 

We  used  an  off-the-shelf  Korean  morphological  ana¬ 
lyzer  (Yoon  et  ah,  1999)  to  facilitate  the  POS  tagging  and 
morphological  analysis.  We  ran  the  entire  corpus  through 
this  morphological  analyzer  and  then  automatically  con¬ 
verted  the  output  POS  tags  to  the  set  of  POS  tags  we  had 
defined.  We  then  hand-corrected  the  errors  in  two  passes. 
The  first  pass  took  roughly  two  months  to  complete  by  two 
annotators.  During  this  period,  various  morphological  is¬ 
sues  from  the  corpus  were  discussed  in  weekly  meetings 
and  guidelines  for  annotating  them  were  decided  and  docu¬ 
mented.  In  the  second  pass,  which  was  undertaken  in  about 
a  month  from  the  completion  of  the  first  phase,  each  anno¬ 
tator  double -checked  and  corrected  the  files  annotated  by 
the  other  annotator. 

3.2.  Phase  II:  Syntactic  bracketing 

The  syntactic  bracketing  also  went  through  two  passes. 
The  first  pass  took  about  6  months  to  complete  by  three 
annotators,  and  the  second  pass  took  about  4  months  to 
complete  by  two  annotators.  In  the  second  pass,  the  an¬ 
notators  double-checked  and  corrected  the  bracketing  done 
during  the  first  pass.  Phase  II  took  much  longer  than  Phase 
I  because  all  the  syntactic  bracketing  had  to  be  done  from 
scratch.  Moreover,  there  were  far  more  syntactic  issues 
to  be  resolved  than  morphological  issues.  As  in  Phase  I, 
weekly  meetings  were  held  to  discuss  and  investigate  the 
syntactic  issues  from  the  corpus  and  annotation  guidelines 
were  decided  and  documented  accordingly.  The  bracket¬ 
ing  was  done  using  the  already  existing  emacs-based  inter¬ 
face  developed  for  Penn  English  Treebanking  (described  in 
(Marcus  et  ah,  1993)),  which  we  customized  for  Korean 
Treebanking.  Using  this  interface  helped  to  avoid  bracket¬ 
ing  mismatches  and  errors  in  syntactic  tag  labeling. 

3.3.  Phase  III:  Quality  control 

In  order  to  ensure  accuracy  and  consistency  of  the  cor¬ 
pus,  the  entire  third  phase  of  the  project  was  devoted  to 
quality  control.  During  this  period,  several  full-scale  ex¬ 
aminations  on  the  whole  corpus  were  conducted,  checking 

3See  (Han  et  ah,  2001)  for  the  documentation  of  our  syntactic 
bracketing  guidelines. 


for  inconsistent  POS  tags  and  illegal  syntactic  bracketings. 
LexTract  was  used  to  detect  formatting  errors  (Xia  et  al., 
2000a). 

3.3.1.  Correcting  POS  tagging  errors 

Errors  in  POS  tagging  can  be  classified  into  three  types: 
(a)  assignment  of  an  impossible  tag  to  a  morpheme  (b)  un¬ 
grammatical  sequence  of  tags  assigned  to  a  word-phrase, 
and  fc)  wrong  choice  of  a  tag  (sequence)  candidate  in  the 
presence  of  multiple  tag  (sequence)  candidates. 

Type  (a)  was  treated  by  compiling  a  tag  dictionary  for 
the  entire  list  of  morphemes  occurring  in  the  corpus.  For 
closed  lexical  categories  such  as  verbal  endings,  postpo¬ 
sition  markers  and  derivational  suffixes,  all  of  them  were 
examined  to  ensure  that  they  are  assigned  with  correct  tags. 
For  open-set  categories  such  as  nouns,  adverbs,  verbs  and 
so  on,  only  those  word-tag  combinations  exhibiting  a  low 
frequency  count  were  individually  checked. 

Treating  type  (b)  required  knowledge  of  Korean  mor- 
phosyntax.  First,  a  table  of  all  tag  sequences  and  their  fre¬ 
quencies  in  the  corpus  was  compiled,  as  shown  in  Table  1 . 

Those  tag  sequences  found  less  than  3  times  were  all 
manually  checked  for  their  grammaticality,  and  corrected  if 
found  illegal.  As  a  next  step,  a  set  of  hand-crafted  morpho- 
tactic  rules  were  created  in  the  form  of  regular  expressions. 
Starting  from  the  most  rigorous  patterns,  we  checked  the 
tag  sequences  against  the  patterns  already  incorporated  in 
the  set  of  grammatical  morphotactic  rules,  expanding  the 
set  as  needed  or  invalidating  a  tag  sequence  according  to 
the  outcome. 

Type  (c),  assignment  of  a  wrong  tag  in  the  case  of  am¬ 
biguity,  cannot  be  handled  by  looking  at  the  moiphemes  by 
themselves,  but  the  syntactic  context  must  be  considered: 
therefore  this  type  of  problem  was  treated  along  with  other 
illegal  syntactic  structures. 

3.3.2.  Correcting  illegal  syntactic  structures 

To  correct  errors  in  syntactic  bracketing,  we  targeted 
each  local  tree  structure  (parent  node  +  daughter  nodes).  To 
do  this,  all  local  tree  structures  were  extracted  in  the  form 
of  context-free  rules  (Table  2).  For  local  trees  with  a  lexi¬ 
cal  daughter  node,  the  lexical  information  was  ignored  and 
only  POS  information  on  the  node  was  listed  in  the  rule. 

The  next  step  taken  was  to  define  the  set  of  context- 
free  rules  for  Korean.  For  each  possible  intermediate  node 
label  (phrasal  categories  as  S,  NP,  VP  and  a  few  lexical  cat¬ 
egories  such  as  VV  and  VJ)  on  the  lefthand  side  of  the  rule, 
its  possible  descendant  node  configuration  was  defined  as  a 
regular  expression,  as  seen  in  (6): 

(6)  a.  VP  (shown  in  part): 

(NP-OBJ(-FV)?  |  NP-COMP(-FV)? 

|  S-COMP  |  S-OBJ  )+  VV\S* 

b.  VV: 

NNC(\+XSF)?\+XSV 

| '  VV\S*  VV\S*$  |  (VV  )*(ADCP  )?VV 

Example  (6a)  shows  that  a  local  tree  with  VP  as  the  parent 
node  can  have  as  its  daughter  nodes  any  numbers  of  NP- 
OBJ,  NP-COMP,  S-COMP  or  S-OBJ  nodes  followed  by  a 
VV  node,  which  is  the  head. 


As  with  the  case  of  word-internal  tag  sequences,  the 
most  frequent  context-free  rules  were  examined  and  incor¬ 
porated  into  the  set  of  rules  first,  and  this  set  gradually  grew 
as  more  and  more  rules  were  examined  and  decided  to  be 
included  in  the  rule  set  or  rejected  to  be  corrected  later. 
As  a  result,  a  large  number  of  illegal  syntactic  bracketings 
were  identified  and  corrected.  Particularly  frequent  types 
of  syntactic  tagging  errors  were:  (a)  redundant  phrasal  pro¬ 
jections  (i.e.  VP  — >•  VP),  (b)  missing  phrasal  projections, 
and  (c)  misplaced  or  ill-scoped  modifying  elements  such  as 
relative  clauses  and  adverbial  phrases/clauses. 

3.3.3.  Corpus  search 

We  compiled  a  list  of  error-prone  or  difficult  syntactic 
constructions  that  had  been  observed  to  be  troublesome  and 
confusing  to  annotators,  and  used  corpus  search  tools  (Ran¬ 
dall,  2000)  to  extract  sentence  structures  containing  each  of 
them  from  the  Treebank.  Each  set  of  extracted  structures 
were  then  examined  and  corrected.  The  list  of  construc¬ 
tions  we  looked  at  in  detail  include  relative  clauses,  com¬ 
plex  noun  phrases,  light  verb  constructions,  complex  verbs, 
and  coordinate  structures.  By  doing  a  construction  by  con¬ 
struction  check  of  the  annotation,  we  were  able  to  not  only 
correct  errors  but  also  enhance  the  consistency  of  our  anno¬ 
tation. 

4.  Statistics  on  the  size  of  the  corpus 

In  this  section,  we  present  some  quantitative  aspects  of 
the  Penn  Korean  Treebank  corpus.  The  corpus  is  a  rela¬ 
tively  small  one  with  54,528  words  and  5,083  sentences, 
averaging  9.158  words  per  sentence.  A  total  of  10,068 
word  types  are  found  in  the  corpus,  therefore  the  measured 
type/token  ratio  (TTR)  is  rather  high  at  0. 185.  However,  for 
languages  with  rich  agglutinative  morphology  such  as  Ko¬ 
rean,  even  higher  type/token  ratios  are  not  uncommon.  For 
comparison,  a  comparably  sized  portion  (54,547  words)  of 
the  ETRI  corpus,  an  annotated  corpus  with  POS  tags,  was 
selected  and  analyzed.4  This  set  contained  19,889  word 
types,  almost  double  the  size  of  that  of  the  Penn  Korean 
Treebank,  as  shown  in  Table  3. 


word 

token 

type 

type/token  ratio 

Treebank 

54,528 

10,068 

0.185 

ETRI 

54,547 

19,889 

0.364 

morpheme 

token 

type 

type/token  ratio 

Treebank 

93,148 

3,555 

0.038 

ETRI 

101,100 

8,734 

0.086 

Table  3:  Type/token  ratios  of  two  corpora 


Taking  individual  morphemes,  rather  than  words  in  their 
fully  inflected  forms,  as  the  evaluation  unit,  the  ratio  be¬ 
comes  much  smaller:  the  Penn  Korean  Treebank  yields  a 

4Total  of  12  files:  essay01.txt,  expll0.txt,  expl34.txt, 
news02.txt,  newsp05.txt,  newspl2.txt,  newspl5.txt,  newspl6.txt, 
novel03.txt,  novell3.txt,  novell5.txt  and  novell9.txt.  For  fair 
comparison,  the  POS  annotated  text  was  re-tokenized  to  suit  the 
Penn  Korean  Treebank  standards. 


Rank 

Count 

Count% 

Total% 

Entry 

1 

8647 

15.85 

15.85 

NNC 

2 

5606 

10.28 

26.14 

NNC+PCA 

3 

5083 

9.32 

35.46 

SFN 

221 

1 

0.00 

99.99 

NNC+XSF+CO+EPF+ENM 

221 

1 

0.00 

100 

NNC+XSV+EPF+EFN+PCA 

Table  1 :  Frequency  of  tag  sequences 


Rank 

Count 

Count% 

Total% 

Entry 

1 

5993 

7.72 

7.72 

S  ->  NP-SBI  VP 

2 

4079 

5.26 

12.98 

NP-SBJ  — >  *pro* 

3 

2425 

3.12 

16.11 

AD  VP  ADV 

1394 

1 

0.00 

99.99 

ADIP  ->  VI+EPF+EFN+PAU 

1394 

1 

0.00 

100 

ADIP  ->  S  NP-ADV  AD  VP  ADJP 

Table  2:  Frequency  of  context-free  rules 


morpheme  type/token  ratio  of  0.038  (93,148  tokens  and 
3,555  types).  Compared  to  the  same  portion  of  the  ETRI 
corpus,  we  can  see  that  the  Penn  Korean  Treebank  still 
shows  a  lower  ratio:  the  ETRI  corpus  showed  a  morpheme 
type/token  ratio  of  0.086  (101,100  morpheme  tokens  and 
8,734  unique  morpheme  types). 

These  results  suggest  that  the  Penn  Korean  Treebank,  as 
a  domain-specific  corpus  in  the  military  domain,  is  highly 
homogeneous  and  low  in  complexity  at  least  in  terms  of  its 
lexical  content.  The  ETRI  corpus,  on  the  other  hand,  con¬ 
sists  of  texts  from  different  genres  including  novels,  news 
articles  and  academic  writings,  hence  the  higher  counts  of 
lexical  entries  per  word  token.  In  our  future  work,  we  hope 
to  expand  the  Treebank  corpus  in  order  to  achieve  a  broader 
and  more  general  coverage. 

5.  Evaluation 

For  evaluating  the  consistency  and  accuracy  of  the  Tree- 
bank,  we  used  Evalb  software  that  produces  three  met¬ 
rics,  bracketing  precision,  bracketing  recall  and  numbers 
of  crossing  brackets,  as  well  as  tagging  accuracy. 

For  the  purposes  of  evaluation,  we  randomly  selected 
10%  of  the  sentences  from  the  corpus  in  the  beginning  of 
the  project  and  saved  them  to  a  file.  These  sentences  were 
then  POS  tagged  and  bracketed  just  like  any  other  sentences 
in  the  corpus.  After  the  first  pass  of  syntactic  bracketing, 
however,  they  were  double  annotated  by  two  different  an¬ 
notators.  We  also  constructed  a  Gold  Standard  annotation 
for  these  test  sentences.  We  then  ran  Evalb  on  the  two  anno¬ 
tated  files  produced  by  the  two  different  annotators  to  mea¬ 
sure  the  inter-annotator  consistency.  Evalb  was  also  run  on 
the  Gold  Standard  and  the  annotation  file  of  the  1st  anno¬ 
tator,  and  on  the  Gold  Standard  and  the  annotation  file  of 
the  2nd  annotator  to  measure  the  individual  annotator  accu¬ 
racy.  Table  4  shows  the  accuracy  of  each  annotator  com¬ 
pared  to  the  Gold  Standard  under  1st/ gold  and  2nd/gold 
column  headings,  and  the  inter-annotator  consistency  un¬ 
der  lst/2nd  column  heading.  It  shows  that  all  the  measures 


are  well  over  95%,  tagging  accuracy  reaching  almost  100%. 
These  measures  indicate  that  the  quality  of  the  Treebank  is 
more  than  satisfactory. 


Consistency 

Accuracy 

lst/2nd 

1  st/gold 

2nd/gold 

Recall 

96.60 

97.69 

98.84 

Precision 

97.97 

98.89 

98.84 

No  Crossing 

95.89 

97.57 

97.53 

Tagging 

99.72 

99.99 

99.77 

Table  4:  Inter-annotator  consistency  and  accuracy  of  the 
Treebank 

Most  of  the  inter-annotator  inconsistencies  belonged  to 
one  of  the  following  types: 

•  In  coordinated  sentences  with  an  empty  subject  and  an 
empty  object,  whether  the  level  of  coordination  is  VV, 
VP  or  S; 

•  Whether  a  sentence  has  an  empty  object  argument  or 
not; 

•  Whether  a  noun  modified  by  a  clause  is  a  relative 
clause  construction  or  a  complex  NP; 

•  Whether  a  verb  is  a  light  verb  or  a  regular  verb; 

•  In  a  complex  sentence  in  which  the  subject  of  the  ma¬ 
trix  clause  and  the  subordinate  clause  are  coreferential, 
whether  a  topic  marked  NP  is  the  subject  of  the  matrix 
clause  or  the  subordinate  clause; 

•  In  a  sentence  with  a  topic  marked  object  NP  and  an 
empty  subject,  whether  the  object  NP  has  undergone 
scrambling  over  the  empty  subject  or  not; 


•  For  an  NP  with  an  adverbial  postposition5,  whether  it 
is  an  argument  or  an  adjunct; 

•  When  an  adverb  precedes  another  adverb  which  in 
turn  precedes  a  verb,  whether  the  first  adverb  modi¬ 
fies  the  adverb  or  the  verb. 

After  the  evaluation  was  done,  as  a  final  cleanup  of  the 
Treebank,  using  corpus  search  tools,  we  extracted  and  cor¬ 
rected  structures  that  belong  to  those  that  may  potentially 
lead  to  the  types  of  inconsistencies  described  above. 

6.  Applications  of  the  Treebank 

6.1.  Morphological  tagger 

We  trained  a  morphological  tagger  on  91%  of  the  54K 
Korean  Treebank  and  tested  it  on  9%  of  the  Treebank  (Han, 
2002).  The  tagger/analyzer  takes  raw  text  as  input  and 
returns  a  lemmatized  disambiguated  output  in  which  for 
each  word,  the  lemma  is  labeled  with  a  POS  tag  and  the 
inflections  are  labeled  with  inflectional  tags.  This  system 
is  based  on  a  simple  statistical  model  combined  with  a 
corpus-driven  rule -based  approach,  comprising  a  trigram- 
based  tagging  component  and  a  morphological  rule  appli¬ 
cation  component. 

The  tagset  consists  of  possible  tag  sequences  (e.g., 
NNC+PCA,  VV+EPF+EFN)  extracted  from  the  Treebank. 
Given  an  input  sentence,  each  word  is  first  tagged  with  a  tag 
sequence.  Tags  for  unknown  words  are  then  updated  using 
inflectional  templates  extracted  from  the  Treebank.  A  few 
example  templates  are  listed  in  Table  5. 


VV+EPF+EFN 

VV+EPF+ECS 

V  V +EPF+ENM 

VV+EPF+ECS 

VV+EPF+ECS 

VV+EPF+ECS 

Table  5:  Example  of  Inflectional  Templates 

Using  an  inflection  dictionary  and  a  stem  dictionary  ex¬ 
tracted  from  the  Treebank,  the  lemma  and  the  inflections 
are  then  identified,  splitting  the  inflected  form  of  the  word 
into  its  constituent  stem  and  affixes.  This  approach  yielded 
95.01%/95.30%  recall/precision  on  the  test  data.  An  exam¬ 
ple  input  and  output  are  shown  below.  The  morphological 
tagger  assigns  POS  tags  and  also  splits  the  inflected  form  of 
the  word  into  its  constituent  stem  and  inflections. 

(7)  a.  Input: 

*11 7]-  4- 

b.  Output: 

x-||  /NPN+  7  )-/PC  A 

-af^/NNC 

74-SJ-/NNC+4-/PCA 

$-3L%}/VV+  5j/EPF+4rT-j  tf/EFN 

,/SFN 


5  Adverbial  postpositions  correspond  to  English  prepositions  in 
function,  e.g.,  ‘-*11 7)]’  (to),  ‘-jsL-x-uj’  (from),  ‘-*)|’  (in),  etc. 


6.2.  Parser 

The  Treebank  has  been  used  to  train  a  statistical  parser 
using  a  probabilistic  Tree  Adjoining  Grammar  (TAG) 
model  (Sarkar,  2002).  The  parser  uses,  as  training  data, 
TAG  derivations  automatically  extracted  from  the  Treebank 
with  Xia’s  (Xia  et  ah,  2000a)  LexTract. 

In  a  probabilistic  TAG  (Schabes,  1992;  Resnik,  1992), 
each  word  in  the  input  sentence  is  assigned  a  set  of  trees, 
called  elementary  trees  that  it  has  selected  in  the  train¬ 
ing  data.  Each  elementary  tree  has  some  word  (called  the 
ANCHOR)  in  the  input  sentence  as  a  node  on  the  frontier. 
A  derivation  proceeds  as  follows:  one  elementary  tree  is 
picked  to  be  the  start  of  the  derivation.  Elementary  trees 
are  then  added  to  this  derivation  using  the  operations  of 
substitution  and  adjunction.  Each  tree  added  in  this  step 
can  be  recursively  modified  via  subsequent  operations  of 
substitution  and  adjunction.  Once  all  the  words  in  the  input 
sentence  have  been  recognized,  the  derivation  is  complete. 
The  parser  outputs  a  derivation  tree,  a  record  of  how  el¬ 
ementary  trees  are  combined  to  generate  a  sentence,  and 
also  a  derived  tree  (read  off  from  the  derivation  tree)  which 
corresonds  to  the  bracketed  structure  of  a  sentence. 

The  parser  is  interfaced  to  the  morphological  tagger  de¬ 
scribed  in  §6.1.  to  avoid  the  sparse  data  problems  likely  to 
be  caused  by  the  highly  agglutinative  nature  of  words  in 
Korean.  The  parser  is  able  to  use  information  from  compo¬ 
nent  parts  of  the  words  that  the  morphological  tagger  pro¬ 
vides.  With  this  method,  we  achieved  75.7%  accuracy  of 
TAG  derivation  dependencies  on  the  test  set  from  the  Tree- 
bank.  An  example  parser  output  of  a  derivation  is  given 
in  Figure  2.  The  index  numbers  in  the  first  column,  and 
the  last  column  of  the  table  represent  the  dependencies  be¬ 
tween  words.  For  instance,  ‘JL-c-  ( motun )’  has  index  0 
and  it  is  dependent  on  the  word  indexed  with  2  ‘  tjj  JL+if- 
(tayho+nun)’ .  ‘Hf^l+Tll  (pakwi+keyf  is  the  root  of  the 
derivation,  marked  by  TOP.  The  morpheme  boundaries  in 
the  words  in  the  2nd  column  are  marked  with  +  sign.  The 
3rd  column  contains  the  tag  sequence  of  the  word,  and  the 
4th  column  lists  the  names  of  the  elementary  tree  anchored 
by  the  word. 

7.  Conclusion 

We  have  described  in  detail  the  annotation  process  as 
well  as  the  methods  we  used  to  ensure  inter-annotator  con¬ 
sistency  and  annotation  accuracy  in  creating  a  54K  word 
Korean  Treebank.6  We  have  also  discussed  the  major  prin¬ 
ciples  employed  in  developing  POS  tagging  and  syntactic 
bracketing  guidelines.  Despite  the  small  size  of  the  Tree- 
bank,  we  were  able  to  successfully  train  a  morphologi¬ 
cal  tagger  (95.78%/95.39%  precision/recall)  and  a  parser 
(73.45%  dependency  accuracy)  using  the  data  from  the 
Treebank.  They  were  incorporated  into  a  Korean/English 
machine  translation  system  which  was  jointly  developed  by 
the  University  of  Pennsylvania  and  CoGenTex  (Han  et  ah, 
2000;  Palmer  et  ah,  2002). 


'’Information  on  our  Penn  Korean  Treebank  can  be  found 
in  www.cis.upenn.edu/~xtag/koreantag/,  including 
POS  tagging  and  syntactic  bracketing  guidelines  as  well  as  a  sam¬ 
ple  bracketed  file. 


Index 

Word 

POS  tag 
(morph) 

Elem 

Tree 

Anchor 

Label 

Node 

Address 

Subst/Adjoin 
into  (Index) 

0 

U  ^ 

-1—  l _ 

DAN 

/3NP*=1 

anchor 

root 

2 

1 

NNC 

/3NP*=1 

anchor 

root 

2 

2 

NNC+PAU 

aNP=0 

anchor 

0 

6 

3 

vjj<y 

ADV 

/3VP*=25 

anchor 

1 

6 

4 

24 

NNU 

/3NP*=1 

anchor 

0 

5 

5 

•A]+°H 

NNX+PAD 

/3VP*=17 

anchor 

1 

6 

6 

V} f\+A 

VV+ECS 

aS-NPs=23 

anchor 

- 

TOP 

7 

sj+xlB. 

VX+EFN 

/3VP*=13 

anchor 

1 

6 

8 

SFN 

- 

- 

- 

- 

Figure  2:  Example  of  derivation  of  a  sentence  reported  by  the  statistical  parser 


We  plan  to  release  the  Treebank  in  the  near  future  mak¬ 
ing  it  available  to  the  wider  community.  The  corpus  we 
used  for  the  Korean  Treebank  is  originally  from  a  Ko¬ 
rean/English  parallel  copora,  and  we  have  recently  finished 
creating  a  Korean/English  parallel  Treebank  by  treebank¬ 
ing  the  English  side  and  aligning  the  two  Treebanks.  We 
would  like  to  expand  the  size  and  coverage  of  the  corpus  by 
treebanking  newswire  corpora,  employing  as  rigorous  an 
annotation  methodology  as  we  did  for  the  54K  Treebank. 
We  hope  to  speed  up  the  annotation  process  by  automating 
the  annotation  process  as  much  as  possible  (Cf.,  along  the 
lines  described  in  (Skut  et  ah,  1997)  for  NEGRA  corpus 
at  the  University  of  Saarbriiken),  incorporating  a  parser  as 
well  as  a  tagger  to  the  annotation  interface. 
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